Week 9.6 - Hands-On Activities and Assessment

🎯 What We'll Cover

Three activities that operationalise the trajectory frame in your own research domain, and a final assessment that asks you to produce a dated snapshot of AI capability and limitation in your field. The deliverable is explicitly time-stamped and is expected to go stale within months — that is the teaching move. Continuous re-calibration is the skill being taught, not the ability to write a static report.

🎯 Activity 1: Hallucination Hunting

Familiar from earlier weeks (Week 5's citation hunt; Week 8's self-recorded transcription test). Here, applied to your own domain in 2026.

📝 Brief

Choose a current frontier model: Claude Opus 4.7, GPT-5.5 (or Pro), Gemini 3.1 Pro, or DeepSeek V4 Pro. Try, deliberately, to induce failures in your own discipline.

Prompts to consider:

Ask for a citation to a recent paper on a niche sub-topic in your field. Verify the paper exists and that the cited claim matches.
Ask for the proof of a result you know well. Check whether the proof is correct or contains plausible-but-wrong steps.
Ask for the contemporary consensus on a debated question in your field. Check whether the model represents the debate accurately or flattens it into a single position.
Ask for a summary of a specific paper's findings. Compare to the actual paper.
Ask the model to apply a method from your field to a problem outside its usual domain. See whether it transfers correctly.

Document each failure. What was the prompt? What did the model produce? What was wrong about it? Categorise per the 9.2 taxonomy: patched (you're using an old model), reduced-but-persistent (frontier behaviour but mitigatable), or structural (won't go away with the next release).

Deliverable: a structured table of at least five distinct failures in your domain, with prompt, output, what was wrong, and category.

🚀 Activity 2: Capability Hunting

The new activity. Hallucination hunting documents what AI fails at; capability hunting documents what it can do that you didn't expect. The pairing is the point — both halves of the calibration in one assessment.

📝 Brief

Choose a task in your own field that “common knowledge” (or your own intuition) says AI can't handle. Test current frontier models on it. Document what actually happened.

Some prompts for thinking about what to test:

A type of analysis you assumed required hand-tuning
A class of problem your field considers “AI-resistant”
A skill that the literature claims AI can't do (date-check the claim)
A task where you have ground truth (your own published work; a benchmark question with known answer)
A task that requires combining knowledge from two sub-fields

Cross-model triangulation is required. Test the same task on at least two different frontier models. Document agreement and disagreement. The disagreements are particularly informative.

Deliverable: a write-up of at least three capability tests, with the assumption you held going in, the prompts you used, what each model produced, your verification of the outputs, and your updated calibration.

📊 Activity 3: Trajectory Tracking

The meta-activity. Test the dated-research trap in practice. Pick a 2023–24 published claim “AI cannot reliably do X” and check whether it still holds against current frontier models.

📝 Brief

Find a 2023 or 2024 paper, blog post, or news article that makes a specific empirical claim about an AI limitation. Quote the claim and the model(s) the claim was tested on. Examples to start from (you should find your own):

Frieder et al. (2023): GPT-4 cannot do graduate-level mathematics — retest on GPT-5.5 Pro and Gemini Deep Think
Berglund et al. (2023): the reversal curse on GPT-3.5 / GPT-4 — retest on Opus 4.7 and DeepSeek V4 Pro
Magesh et al. (2024): purpose-built legal AI tools hallucinate 17–33% — check whether updated versions still do
A claim from your own field, ideally one you've cited in your own work

Run the test yourself with current frontier models. Document the results. Reflect on:

Does the claim still hold? Partially? Has it been entirely superseded?
If superseded, what changed — better training, more parameters, different architecture, different prompting?
What did the original paper get right structurally, even if the specific empirical claim is now dated?
Should the original citation still appear in your literature review? With what framing?

Deliverable: a 400–600 word write-up of one trajectory test, with original claim, retest methodology, retest results, and a paragraph on how the original paper should now be cited.

📝 Weekly Assessment

📚 AI in [your field] — May 2026 Snapshot

Length: approximately 1,500 words

Due: end of Week 9 (specific date in Amathuba)

Format: structured report with sections for capability, limitation, methodology, and reflection

What the snapshot should contain:

Capability findings. Drawing on Activity 2, document what current frontier models can do in your field. Be specific: which model, which version, when tested, what prompts, what outputs, what verification you did. At least three documented capabilities.
Limitation findings. Drawing on Activity 1, document where current frontier models still fail in your field. Categorise by the 9.2 taxonomy. At least five documented failures.
Methodology. Document the verification protocols you used. Which models did you test? Which prompts? How did you verify the outputs? What did you learn about your own verification practice along the way?
Reflection: when this snapshot will go stale. Identify which of your findings are most likely to be superseded soonest. What kinds of evidence would prompt you to update the snapshot? What is your recommended retest cadence for your domain?
Disclosure statement (per Lesson 6 Sub-Lesson 5 format): which AI tools you used in producing the report itself, what for, and how you verified.

Grading split (100 points):

Capability findings — 25 points (specificity, dated evidence, verification quality)
Limitation findings — 25 points (specificity, taxonomy use, dated evidence)
Methodology — 25 points (rigour of verification, cross-model triangulation, manual spot-checks)
Reflection on staleness and retest cadence — 15 points
Disclosure statement — 10 points

🎯 Why the Deliverable Acknowledges Its Own Future Obsolescence

The point of the snapshot is not the snapshot itself. The point is the practice of producing it: knowing which models you tested, when, with what prompts, with what verification, and being able to reproduce the exercise in six months and notice what has changed.

A static body of knowledge about “what AI can do in my field” is not what we're asking for. We are asking for a calibrated, dated, retestable starting point that you can update as the field changes — and the meta-skill of knowing how to update it.

📚 Week 9 in Summary

What you should leave the week with

A trajectory frame. Every claim about AI capability has an implicit “as of [date]”. From Sub-Lesson 9.1, you have a calibrated picture of where the May 2026 frontier sits.

A failure taxonomy. When you encounter an AI failure, you can locate it in patched / reduced-but-persistent / structural. From Sub-Lesson 9.2, you have the categories.

A capability map. From Sub-Lesson 9.3, you have concrete evidence of where AI is now genuinely strong — including in fields where two years ago the consensus said it couldn't contribute.

An epistemic frame that doesn't go stale. From Sub-Lesson 9.4, you have the Messeri & Crockett four-illusion framework as a durable lens on what happens to your own understanding when AI does more of the cognitive work.

Verification protocols. From Sub-Lesson 9.5, you have practical techniques for checking specific outputs and for reading capability claims critically.

Practice in your own field. From Sub-Lesson 9.6's activities and assessment, you have a documented snapshot of AI in your discipline as of May 2026, and the meta-skill to update it.

👉 What Comes Next

Week 10 — Agentic AI, RAG & Advanced Research Tools. Now that you have a calibrated picture of current capability and a verification habit in place, the next week introduces the more powerful agentic tools. The order matters: agentic AI amplifies both capability and risk. The verification literacy you built this week is what makes the next week safe to teach.